M01: Lecture Note 2
Language, Probability, and Generative Systems
1 Text Analytics and Sentiment Analysis
1.1 Introduction to Text Analytics
Text analytics is the discipline concerned with extracting meaningful information, patterns, and insights from unstructured text data. In today’s digital world, vast amounts of information are generated in textual form — emails, social media posts, customer reviews, reports, and more. Text analytics provides the computational tools and methodologies to transform this raw text into structured knowledge that organizations can use for decision‑making, automation, and research.
The figure below provides a high‑level overview of the text mining landscape, illustrating how raw text is processed, analyzed, and converted into actionable insights.
Text analytics draws from multiple fields — including information retrieval, machine learning, statistics, and linguistics — to build systems that can understand and interpret human language at scale. It is often used interchangeably with text mining, though the two terms have subtle differences that we will clarify shortly.
1.2 Text Mining Process
Text mining refers to the computational process of discovering patterns, extracting information, and generating structured representations from unstructured text. The workflow typically involves several stages, from lexical processing (working with characters and tokens) to structural and semantic interpretation (building syntax trees, extracting entities, and constructing knowledge bases).
The diagram below illustrates a typical NLP pipeline used in text mining systems:
This pipeline highlights three major layers:
1.2.1 Lexical Processing
Lexical processing focuses on the surface form of text — characters, words, and tokens.
- Characters represent the raw textual input.
- Tokens are meaningful units such as words or punctuation marks.
- Tagged tokens include additional linguistic information such as part‑of‑speech tags.
This stage prepares the text for deeper syntactic and semantic analysis.
1.2.2 Structural Representation
Once text is tokenized and annotated, the system constructs higher‑level structures: - Syntax trees capture grammatical relationships between words.
- Entity relationships identify connections between named entities (e.g., “Apple acquired Beats”).
- Knowledge bases store structured facts extracted from text.
These representations enable downstream tasks such as information extraction and reasoning.
1.2.3 Algorithmic Components
The pipeline integrates algorithmic modules such as: - Regular expressions for pattern matching
- Part‑of‑speech taggers
- Logic compilers for rule‑based reasoning
- Information extractors for identifying entities, events, and relations
These components interact with the structural layers to produce meaningful outputs.
1.3 Text Analytics: Definitions and Scope
Text analytics is a broader umbrella term that encompasses text mining as well as other analytical processes. It includes tasks such as information retrieval, summarization, classification, and visualization. The relationship between text mining and text analytics can be expressed mathematically as follows:
In other words: - Text mining focuses on extracting structured information from text.
- Text analytics includes text mining plus the broader ecosystem of tools used to search, retrieve, and analyze text at scale.
1.4 Application Areas of Text Mining
Text mining supports a wide range of applications across industries. Some of the most common include:
1.4.1 Information Extraction
Automatically identifying entities, relationships, and events from text.
1.4.2 Topic Tracking
Monitoring how topics evolve over time in news, social media, or research literature.
1.4.3 Summarization
Generating concise summaries of long documents, either extractively or abstractively.
1.4.4 Categorization
Assigning documents to predefined categories (e.g., spam detection, news classification).
1.4.5 Clustering
Grouping similar documents without predefined labels.
1.4.6 Concept Linking
Connecting related concepts across documents to reveal hidden associations.
1.4.7 Question Answering
Building systems that can answer natural‑language questions using textual data.
These applications demonstrate the versatility of text mining in both academic and commercial settings.
1.5 Text Mining and Analytics Pipeline
The following figure provides a general overview of the NLP pipeline that underlies most text mining and analytics systems:
This pipeline typically includes: - Text preprocessing
- Feature extraction
- Model training or rule‑based analysis
- Evaluation and deployment
Each stage builds upon the previous one to transform raw text into structured insights.
1.6 Sentiment Analysis
Sentiment analysis is one of the most widely used applications of text mining. It aims to determine the emotional tone or subjective opinion expressed in text. This is especially valuable in domains such as marketing, customer service, finance, and social media analytics.
The figure below illustrates the major tasks, tools, and methods used in sentiment analysis:
1.6.1 Methods
Sentiment analysis can be approached using several methodological families: - Lexicon‑based methods, which rely on predefined sentiment dictionaries
- Machine learning methods, which learn patterns from labeled data
- Deep learning methods, which use neural networks to capture complex linguistic patterns
- Hybrid methods, which combine lexicons with machine learning for improved accuracy
1.6.2 Applications
Sentiment analysis is used in: - Domain‑specific applications such as product reviews or political analysis
- Large language model pipelines, where sentiment signals can guide downstream tasks
1.6.3 Challenges
Two major categories of challenges arise: - Methodological challenges, such as handling sarcasm or domain adaptation
- Text context challenges, including ambiguity, negation, and cultural variation
1.7 Sentiment Classification Algorithms
The following DOT diagram summarizes the major families of sentiment classification algorithms:
This taxonomy divides sentiment analysis approaches into two broad categories:
1.7.1 Machine Learning Approaches
These include: - Supervised learning, where models learn from labeled examples - Specific algorithms include SVMs, neural networks, deep learning models, Naïve Bayes, Bayesian networks, and maximum entropy models. - Decision tree classifiers
- Linear classifiers
- Rule‑based classifiers
- Probabilistic models
- Unsupervised learning, where models infer structure without labels
1.7.2 Lexicon‑Based Approaches
These rely on sentiment dictionaries or corpus‑derived lexicons.
Two major subtypes include: - Dictionary‑based approaches, which use curated word lists
- Corpus‑based approaches, which infer sentiment from statistical or semantic patterns in large corpora
1.8 Types of Sentiment Analysis
Sentiment analysis can be specialized into several sub‑tasks:
- Aspect‑based sentiment analysis, which identifies sentiment toward specific product attributes
- Emotion‑based analysis, which classifies emotions such as joy, anger, or fear
- Fine‑grained sentiment analysis, which assigns sentiment scores on a multi‑point scale
- Intent‑based analysis, which infers user intentions behind the text
The following sequence of images illustrates these types:
3 Natural Language Processing (NLP)
Natural Language Processing (NLP) is the field that enables computers to process, understand, generate, and interact with human language. NLP systems bridge raw text and computational models, allowing machines to interpret meaning, perform tasks, and generate coherent language.
Key capabilities include:
- Learning useful representations: encoding text into structured forms (e.g., embeddings) that capture meaning
- Generating language: producing text for tasks such as translation, summarization, or dialogue
- Connecting language and action: enabling systems to use language to perform tasks, reason, or interact with environments
4 General NLP Framework
At its core, NLP involves learning a function that maps an input to an output , where either or both involve language. The table below illustrates common NLP tasks:
This framework covers tasks such as language modeling, translation, classification, linguistic analysis, and multimodal tasks like image captioning.
5 Building NLP Systems
NLP systems can be built in several ways, ranging from rule‑based approaches to modern machine learning and prompting.
5.0.1 Rule‑based Systems
These rely on manually crafted rules:
def classify(x: str) -> str:
sports_keywords = ["baseball", "soccer", "football", "tennis"]
if any(keyword in x for keyword in sports_keywords):
return "sports"
else:
return "other"
5.0.2 Prompting
Prompting uses a language model without training:
5.0.3 Fine‑tuning
Fine‑tuning trains a model on paired examples :
6 Data Requirements for System Building
Different approaches require different amounts of data:
- Rules or intuition‑based prompting: no data required
- Spot‑check prompting: small samples of input
- Rigorous evaluation: development and test sets
- Fine‑tuning: large labeled datasets; performance improves with scale
7 Natural Language Processing Pipeline
The NLP pipeline transforms raw text into structured representations and downstream outputs. It typically includes data ingestion, parsing, cleaning, feature engineering, and consumption by models or analytics systems.
7.1 Text Summarization
Text summarization condenses long documents into concise, informative summaries. The process includes preprocessing, feature extraction, sentence ranking, and summary construction.
7.2 Core NLP Tasks
NLP encompasses a wide range of tasks, including:
- Part‑of‑speech tagging
- Text segmentation
- Word sense disambiguation
- Handling syntactic ambiguity
- Speech acts
- Question answering
- Summarization
- Natural language generation and understanding
- Machine translation
- Speech recognition and text‑to‑speech
- OCR
- Text proofing
8 Classical vs. Deep Learning NLP
To avoid redundancy, only one representative version of each diagram is kept.
9 Sentiment Classification
Sentiment analysis determines whether text expresses positive, negative, or neutral sentiment.
9.1 Text Information
def read_xy_data(filename: str) -> tuple[list[str], list[int]]:
x_data = []
y_data = []
with open(filename, 'r') as f:
for line in f:
label, text = line.strip().split(' ||| ')
x_data.append(text)
y_data.append(int(label))
return x_data, y_data
x_train, y_train = read_xy_data('./data/sentiment-treebank/train.txt')
x_test, y_test = read_xy_data('./data/sentiment-treebank/dev.txt')
print("Document:-", x_train[0])
print("Label:-", y_train[0])Document:- The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
Label:- 1
9.2 Segmentation, Tokenization, and Cleaning
def extract_features(x: str) -> dict[str, float]:
features = {}
x_split = x.split(' ')
# Count the number of "good words" and "bad words" in the text
good_words = ['love', 'good', 'nice', 'great', 'enjoy', 'enjoyed']
bad_words = ['hate', 'bad', 'terrible',
'disappointing', 'sad', 'lost', 'angry']
for x_word in x_split:
if x_word in good_words:
features['good_word_count'] = features.get(
'good_word_count', 0) + 1
if x_word in bad_words:
features['bad_word_count'] = features.get(
'bad_word_count', 0) + 1
# The "bias" value is always one, to allow us to assign a "default" score to the text
features['bias'] = 1
return features
feature_weights = {'good_word_count': 1.0, 'bad_word_count': -1.0, 'bias': 0.5}9.3 Decision Algorithm
def run_classifier(x: str) -> int:
score = 0
for feat_name, feat_value in extract_features(x).items():
score = score + feat_value * feature_weights.get(feat_name, 0)
if score > 0:
return 1
elif score < 0:
return -1
else:
return 0
def calculate_accuracy(x_data: list[str], y_data: list[int]) -> float:
total_number = 0
correct_number = 0
for x, y in zip(x_data, y_data):
y_pred = run_classifier(x)
total_number += 1
if y == y_pred:
correct_number += 1
return correct_number / float(total_number)9.4 Results
label_count = {}
for y in y_test:
if y not in label_count:
label_count[y] = 0
label_count[y] += 1
print(label_count)
train_accuracy = calculate_accuracy(x_train, y_train)
test_accuracy = calculate_accuracy(x_test, y_test)
print(f'Train accuracy: {train_accuracy}')
print(f'Dev/test accuracy: {test_accuracy}')
# Display 4 decimal
print(f'Train accuracy: {train_accuracy:.4f}')
print(f'Dev/test accuracy: {test_accuracy:.4f}'){1: 444, 0: 229, -1: 428}
Train accuracy: 0.4345739700374532
Dev/test accuracy: 0.4214350590372389
Train accuracy: 0.4346
Dev/test accuracy: 0.4214
9.5 Model Evaluation
import random
def find_errors(x_data, y_data):
error_ids = []
y_preds = []
for i, (x, y) in enumerate(zip(x_data, y_data)):
y_preds.append(run_classifier(x))
if y != y_preds[-1]:
error_ids.append(i)
for _ in range(5):
my_id = random.choice(error_ids)
x, y, y_pred = x_data[my_id], y_data[my_id], y_preds[my_id]
print(f'{x}\ntrue label: {y}\npredicted label: {y_pred}\n')
find_errors(x_train, y_train)`` Freaky Friday , '' it 's not .
true label: -1
predicted label: 1
-LRB- Screenwriter -RRB- Pimental took the Farrelly Brothers comedy and feminized it , but it is a rather poor imitation .
true label: 0
predicted label: 1
... this is n't even a movie we can enjoy as mild escapism ; it is one in which fear and frustration are provoked to intolerable levels .
true label: -1
predicted label: 1
The movie itself appears to be running on hypertime in reverse as the truly funny bits get further and further apart .
true label: -1
predicted label: 1
But it also comes with the laziness and arrogance of a thing that already knows it 's won .
true label: -1
predicted label: 1
9.6 Improving the Model
A typical improvement loop:
- Diagnose errors
- Modify features or scoring
- Measure improvements
- Iterate
- Evaluate on test data
10 Linguistic Barriers
Challenges include:
- Low‑frequency words
- Conjugation
- Negation
- Metaphor
- Analogy
- Symbolic language
Consider how feature engineering or modern embeddings can address these issues.
11 Probabilistic Topic Modeling
Topic modeling uncovers latent themes in large text corpora.
11.1 Machine Learning Foundations
Machine learning aims to estimate a function that predicts labels from text.
The function may be linear or nonlinear, hand‑crafted or learned from data.
11.2 Bag of Words Approach
Bag of Words (BoW) represents text as unordered collections of word counts.
11.3 Why BoW Matters
- Converts text into fixed‑length numeric vectors
- Simple, interpretable, and effective for many tasks
- Ignores word order but preserves frequency
11.4 Text Cleaning
Original: Despite suffering a sense-of-humour failure...
Cleaned: despite suffering a sense of humour failure...
import random
def sample_sentences(x, y, n=4, seed=42):
random.seed(seed)
idx = random.sample(range(len(x)), n)
return [(y[i], x[i]) for i in idx]
samples = sample_sentences(x_train, y_train, n=4)
for i, (label, text) in enumerate(samples, 1):
print(f"S{i} [label={label}]: {text}")S1 [label=1]: With Dickens ' words and writer-director Douglas McGrath 's even-toned direction , a ripping good yarn is told .
S2 [label=0]: Maybe Thomas Wolfe was right : You ca n't go home again .
S3 [label=-1]: Despite suffering a sense-of-humour failure , The Man Who Wrote Rocky does not deserve to go down with a ship as leaky as this .
S4 [label=1]: It will guarantee to have you leaving the theater with a smile on your face .
Cleaning typically includes lowercasing, removing punctuation, and optional stopword removal.
11.5 Tokenization and CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
docs = [text for _, text in samples]
vectorizer = CountVectorizer(
lowercase=True,
stop_words=None # keep everything for teaching clarity
)
X = vectorizer.fit_transform(docs)
bow_df = pd.DataFrame(
X.toarray(),
columns=vectorizer.get_feature_names_out(),
index=[f"S{i+1}" for i in range(len(docs))]
)
bow_df.iloc[:, 0:8]| again | and | as | ca | deserve | despite | dickens | direction | |
|---|---|---|---|---|---|---|---|---|
| S1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| S2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| S3 | 0 | 0 | 2 | 0 | 1 | 1 | 0 | 0 |
| S4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11.6 Vocabulary, DTM, and Word Frequencies
BoW produces a document‑term matrix (DTM) where rows are documents and columns are word counts.
12 Strengths and Limitations of BoW
12.1 Strengths
- Simple
- Fast
- Effective for short, structured text
12.2 Limitations
- Ignores order and meaning
- Sparse representations
- Vocabulary explosion
12.3 BoW in Practice (Sentiment Analysis)
12.4 When to Use Bag of Words
Use BoW when:
- Data is small or medium
- Interpretability matters
- You need a fast baseline
Avoid BoW when:
- Documents are long
- Semantic nuance matters
- Context is essential
12.5 Key Takeaways
- Bag of Words is fundamentally about counting, not understanding
- It is a stepping stone to TF‑IDF, embeddings, and transformers
- Representation is the foundation of all NLP
- Generative and discriminative models offer complementary perspectives
2.4 Social Analytics
Social analytics focuses on understanding digital interactions and relationships across social platforms. As individuals, organizations, and communities increasingly communicate online, social analytics provides the tools to measure influence, detect trends, and interpret collective behavior.
A widely accepted definition describes social analytics as:
Social analytics encompasses two major subfields:
The diagram below captures this conceptual division:
2.4.1 Social Network Analysis (SNA)
SNA examines how individuals or entities are connected. It uses graph theory to analyze nodes (people, organizations) and edges (relationships, interactions). Key metrics include centrality, density, modularity, and community structure. SNA is used in fields ranging from epidemiology to marketing and organizational behavior.
2.4.2 Social Media Analytics (SMA)
SMA focuses on the content and interactions occurring on social platforms. It includes sentiment analysis, trend detection, topic modeling, engagement measurement, and influencer identification. SMA helps organizations understand public opinion, track brand perception, and respond to emerging issues in real time.